531 research outputs found

    Laplace's rule of succession in information geometry

    Full text link
    Laplace's "add-one" rule of succession modifies the observed frequencies in a sequence of heads and tails by adding one to the observed counts. This improves prediction by avoiding zero probabilities and corresponds to a uniform Bayesian prior on the parameter. The canonical Jeffreys prior corresponds to the "add-one-half" rule. We prove that, for exponential families of distributions, such Bayesian predictors can be approximated by taking the average of the maximum likelihood predictor and the \emph{sequential normalized maximum likelihood} predictor from information theory. Thus in this case it is possible to approximate Bayesian predictors without the cost of integrating or sampling in parameter space

    Bandit Online Optimization Over the Permutahedron

    Full text link
    The permutahedron is the convex polytope with vertex set consisting of the vectors (π(1),,π(n))(\pi(1),\dots, \pi(n)) for all permutations (bijections) π\pi over {1,,n}\{1,\dots, n\}. We study a bandit game in which, at each step tt, an adversary chooses a hidden weight weight vector sts_t, a player chooses a vertex πt\pi_t of the permutahedron and suffers an observed loss of i=1nπ(i)st(i)\sum_{i=1}^n \pi(i) s_t(i). A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of O(nTlogn)O(n\sqrt{T \log n}) for a time horizon of TT. Unfortunately, CombBand requires at each step an nn-by-nn matrix permanent approximation to within improved accuracy as TT grows, resulting in a total running time that is super linear in TT, making it impractical for large time horizons. We provide an algorithm of regret O(n3/2T)O(n^{3/2}\sqrt{T}) with total time complexity O(n3T)O(n^3T). The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters

    Statistical Mechanics of Linear and Nonlinear Time-Domain Ensemble Learning

    Full text link
    Conventional ensemble learning combines students in the space domain. In this paper, however, we combine students in the time domain and call it time-domain ensemble learning. We analyze, compare, and discuss the generalization performances regarding time-domain ensemble learning of both a linear model and a nonlinear model. Analyzing in the framework of online learning using a statistical mechanical method, we show the qualitatively different behaviors between the two models. In a linear model, the dynamical behaviors of the generalization error are monotonic. We analytically show that time-domain ensemble learning is twice as effective as conventional ensemble learning. Furthermore, the generalization error of a nonlinear model features nonmonotonic dynamical behaviors when the learning rate is small. We numerically show that the generalization performance can be improved remarkably by using this phenomenon and the divergence of students in the time domain.Comment: 11 pages, 7 figure

    IR ion spectroscopy in a combined approach with MS/MS and IM-MS to discriminate epimeric anthocyanin glycosides (cyanidin 3-O-glucoside and -galactoside)

    Get PDF
    Anthocyanins are widespread in plants and flowers, being responsible for their different colouring. Two representative members of this family have been selected, cyanidin 3-O-β-glucopyranoside and 3-O-β-galactopyranoside, and probed by mass spectrometry based methods, testing their performance in discriminating between the two epimers. The native anthocyanins, delivered into the gas phase by electrospray ionization, display a comparable drift time in ion mobility mass spectrometry (IM-MS) and a common fragment, corresponding to loss of the sugar moiety, in their collision induced dissociation (CID) pattern. However, the IR multiple photon dissociation (IRMPD) spectra in the fingerprint range show a feature particularly evident in the case of the glucoside. This signature is used to identify the presence of cyanidin 3-O-β-glucopyranoside in a natural extract of pomegranate. In an effort to increase any differentiation between the two epimers, aluminum complexes were prepared and sampled for elemental composition by FT-ICR-MS. CID experiments now display an extensive fragmentation pattern, showing few product ions peculiar to each species. More noteworthy is the IRMPD behavior in the OH stretching range showing significant differences in the spectra of the two epimers. DFT calculations allow to interpret the observed distinct bands due to a varied network of hydrogen bonding and relative conformer stability

    Byzantine Stochastic Gradient Descent

    Full text link
    This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the mm machines which allegedly compute stochastic gradients every iteration, an α\alpha-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds ε\varepsilon-approximate minimizers of convex functions in T=O~(1ε2m+α2ε2)T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big) iterations. In contrast, traditional mini-batch SGD needs T=O(1ε2m)T = O\big( \frac{1}{\varepsilon^2 m} \big) iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity

    On the Prior Sensitivity of Thompson Sampling

    Full text link
    The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm's regret is low (high) when the prior is good (bad), little is known about the exact dependence. In this paper, we fully characterize the algorithm's worst-case dependence of regret on the choice of prior, focusing on a special yet representative case. These results also provide insights into the general sensitivity of the algorithm to the choice of priors. In particular, with pp being the prior probability mass of the true reward-generating model, we prove O(T/p)O(\sqrt{T/p}) and O((1p)T)O(\sqrt{(1-p)T}) regret upper bounds for the bad- and good-prior cases, respectively, as well as \emph{matching} lower bounds. Our proofs rely on the discovery of a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning Theory (ALT), 201
    corecore